Skip to content

Conversation

@oleksandr-pavlyk
Copy link
Contributor

In [1]: import dpctl.tensor as dpt, dpctl, dpctl.utils

In [2]: n, m = 8 * 540, 8 * 960

In [3]: a = dpt.ones((m, n))

In [4]: b = dpt.zeros((m, n))

In [5]: b_s = dpt.zeros((m, n+2))

In [6]: with dpctl.utils.onetrace_enabled():
   ...:     b_s[:,:-2] += a
      ...:
      Device Timeline (queue: 0x556080b9cea0): zeCommandListAppendMemoryCopy(H2D)[48 bytes]<4.1> [ns] = 16946404661 (append) 16952292497 (submit) 16952613747 (start) 16952623538 (end)
      Device Timeline (queue: 0x556080b9cea0): dpctl::tensor::kernels::add::add_inplace_strided_kernel<float, float, dpctl::tensor::offset_utils::TwoOffsets_StridedIndexer>[SIMD32 {64800; 1; 1} {512; 1; 1}]<5.1> [ns] = 17017855801 (append) 17018342202 (submit) 17019138920 (start) 17030770482 (end)

Earlier, two more copy operations were being performed as well.

Previously:

In [7]: %time b_s[:,:-2] += a
CPU times: user 13.2 ms, sys: 24.7 ms, total: 37.9 ms
Wall time: 53 ms

Now:

In [7]: %time b_s[:,:-2] += a
CPU times: user 5.08 ms, sys: 9.58 ms, total: 14.7 ms
Wall time: 16.7 ms
  • Have you provided a meaningful PR description?
  • Have you added a test, reproducer or referred to an issue with a reproducer?
  • Have you tested your changes locally for CPU and GPU devices?
  • Have you made sure that new changes do not introduce compiler warnings?
  • Have you checked performance impact of proposed changes?
  • If this PR is a work in progress, are you opening the PR as a draft?

@AlexanderKalistratov
Copy link

Shouldn't it also fix sqrt with 'out' for pairwise distance?

@github-actions
Copy link

```
In [1]: import dpctl.tensor as dpt, dpctl, dpctl.utils

In [2]: n, m = 8 * 540, 8 * 960

In [3]: a = dpt.ones((m, n))

In [4]: b = dpt.zeros((m, n))

In [5]: b_s = dpt.zeros((m, n+2))

In [6]: with dpctl.utils.onetrace_enabled():
   ...:     b_s[:,:-2] += a
      ...:
      Device Timeline (queue: 0x556080b9cea0): zeCommandListAppendMemoryCopy(H2D)[48 bytes]<4.1> [ns] = 16946404661 (append) 16952292497 (submit) 16952613747 (start) 16952623538 (end)
      Device Timeline (queue: 0x556080b9cea0): dpctl::tensor::kernels::add::add_inplace_strided_kernel<float, float, dpctl::tensor::offset_utils::TwoOffsets_StridedIndexer>[SIMD32 {64800; 1; 1} {512; 1; 1}]<5.1> [ns] = 17017855801 (append) 17018342202 (submit) 17019138920 (start) 17030770482 (end)
```

Earlier, two more copy operations were being performed as well.
@oleksandr-pavlyk oleksandr-pavlyk force-pushed the improve-overlap-check-in-copy branch from 17a2623 to 701c05b Compare July 17, 2023 14:24
@github-actions
Copy link

Array API standard conformance tests for dpctl=0.14.5dev1=py310h7bf5fec_11 ran successfully.
Passed: 448
Failed: 552
Skipped: 119

1 similar comment
@github-actions
Copy link

Array API standard conformance tests for dpctl=0.14.5dev1=py310h7bf5fec_11 ran successfully.
Passed: 448
Failed: 552
Skipped: 119

@oleksandr-pavlyk oleksandr-pavlyk merged commit a6d16f2 into master Jul 17, 2023
@oleksandr-pavlyk oleksandr-pavlyk deleted the improve-overlap-check-in-copy branch July 17, 2023 17:43
@github-actions
Copy link

Deleted rendered PR docs from intelpython.github.com/dpctl, latest should be updated shortly. 🤞

@github-actions
Copy link

Array API standard conformance tests for dpctl=0.14.5dev1=py310h7bf5fec_11 ran successfully.
Passed: 448
Failed: 552
Skipped: 119

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants